Skip to content

perf/chunktransform#3722

Open
d-v-b wants to merge 22 commits intozarr-developers:mainfrom
d-v-b:perf/codec-chain
Open

perf/chunktransform#3722
d-v-b wants to merge 22 commits intozarr-developers:mainfrom
d-v-b:perf/codec-chain

Conversation

@d-v-b
Copy link
Copy Markdown
Contributor

@d-v-b d-v-b commented Feb 26, 2026

This PR defines a CodecChain object, which is similar to CodecPipeline except that it performs no orchestration. It just defines array-spec-aware, pure-compute encoding and decoding routines for collections of codecs. Part of #3720

depends on #3721

edit: CodecChain is now called ChunkTransform

d-v-b and others added 4 commits February 25, 2026 18:33
Introduces CodecChain, a frozen dataclass that chains array-array,
array-bytes, and bytes-bytes codecs with synchronous encode/decode
methods. Pure compute only -- no IO, no threading, no batching.

Also adds sync roundtrip tests for individual codecs (blosc, gzip,
zstd, crc32c, bytes, transpose, vlen) and CodecChain integration tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 26, 2026
@d-v-b d-v-b mentioned this pull request Feb 26, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Feb 27, 2026

updating this PR to blend the codecchain logic with array_spec logic to define a ChunkTransform object that carries both together.

@d-v-b d-v-b changed the title perf/codec chain perf/chunktransform Mar 16, 2026
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Mar 16, 2026

this PR is necessary for the broader performance plan because it allows us to pipeline codecs that support synchronous execution (i.e., all of them, practically). We can use the ChunkTransform class to "fuse" the non-IO part of chunk encoding into a single callable.

this differs from the CodecPipeline because that data structure does orchestration and IO, which is a separate layer of complexity.

@d-v-b d-v-b requested a review from dcherian March 16, 2026 21:19
@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Mar 26, 2026

@TomAugspurger any interest in reviewing this PR? It's part of a series that culminate in the performance improvements evident in #3719

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 to the general design and the implementation looks nice. Just minor nitpicks that can be addressed / ignored as you prefer.

ArrayArrayCodec transforms — i.e. the spec that feeds the
ArrayBytesCodec.

All codecs must implement ``SupportsSyncCodec``. Construction will
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idle thought without having yet looked at the implementation: make Codec generic over SupportsSyncCodec (via a protocol) so that this can be caught before runtime?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to avoid touching the codec ABC, so maybe we defer this for a later effort

self._bb_codecs = bb

@property
def shape(self) -> tuple[int, ...]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are shape and dtype ultimately used? They're a bit complicated to understand. Presumably you need them for your full perf PR, but I wanted to confirm that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about these attributes actually. I wanted to model the fact that the output of a ChunkTransform is an array, with a fixed shape and dtype, and the fact that the array -> array codecs can be thought of as layers of ChunkTransform objects. But I don't know if we actually need this. Maybe a richer return type annotation is a better way of conveying this information.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these attributes are gone, we can add them later if they are actually valuable

"""
bb_out: Any = chunk_bytes
for bb_codec in reversed(self._bb_codecs):
bb_out = bb_codec._decode_sync(bb_out, self._ab_spec) # type: ignore[attr-defined]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment about why the type: ignore is needed? Presumably it relates to that isinstance(c, SupportsSyncCodec) check above, which mypy can't see here in decode?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the type: ignore is gone thanks to making the protocol generic.

for aa_codec, spec in reversed(self._aa_codecs):
ab_out = aa_codec._decode_sync(ab_out, spec) # type: ignore[attr-defined]

return ab_out # type: ignore[no-any-return]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type: ignore I don't understand. Is the Any up above accurate, or should this be NDBuffer | Buffer like I see here? And if it supposed to be NDBuffer | Buffer why do we declare we return -> NDBuffer

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was because the protocol for synchronous encoding / decoding was not generic over input and output types. I fixed that, so the type: ignore statement is gone

Comment on lines +53 to +59
def test_encode_decode_roundtrip_bytes_only(self) -> None:
# Minimal round-trip: BytesCodec serializes the array to bytes and back.
# No compression, no AA transform.
arr = np.arange(100, dtype="float64")
spec = _make_array_spec(arr.shape, arr.dtype)
chain = ChunkTransform(codecs=(BytesCodec(),), array_spec=spec)
nd_buf = _make_nd_buffer(arr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I'd be fine with consolidating these tests with the construction tests. I do like having focused construction tests when the constructors are complicated, but these seem simple enough that just seeing the traceback pointing to __post_init__ should be enough. Either works for me though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. I kept the encoding / decoding separate from the constructor tests, but I also made the constructor tests much more compact.

Comment on lines +61 to +64
encoded = chain.encode(nd_buf)
assert encoded is not None
decoded = chain.decode(encoded)
np.testing.assert_array_equal(arr, decoded.as_numpy_array())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth refactoring this to a test helper, assert_encode_decode_equal(...).

Also, not worth worrying about currently, arr is possibly not a NumPy array, depending on what default_buffer_prototype() returns, in which case np.testing.assert_array_equal might not work. But to solve that more generally is out of scope.


if aa_out is None:
return None
bb_out: Any = self._ab_codec._encode_sync(aa_out, self._ab_spec) # type: ignore[attr-defined]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite hard to read/review: "is the output of aa after application of ab really bb?!" Doesn't help that output is Any despite all the fancy typing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you suggest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aa_out -> asarray; bb_out -> asbytes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so it's an issue with the variable names, I will see if I can make them more clear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the properties are an issue too but I don't have a suggestion for that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[points accusingly at the v3 spec] making 3 different kinds of functions all "codecs" was maybe not the best choice. We could use the "filters, serializer, compressors" trinity used in create_array, but I worry that this is a bit too far from the language used by the spec.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to punt on this for now. I legit think the naming issues here are a basic flaw in the v3 spec.

Comment on lines +169 to +170
if bb_out is None:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be hoisted out of the loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe not? I'm confused regardless.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole thing with encoding potentially returning none is problematic and needs to go. Unfortunately that's based on the codec ABC which I am not touching right now :(

Comment on lines +160 to +161
if aa_out is None:
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be moved out of the loop? Can _encode_sync return None?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can _encode_sync return None?

unfortunately yes, in keeping with methods already defined on the Codec API.

@d-v-b
Copy link
Copy Markdown
Contributor Author

d-v-b commented Mar 27, 2026

thanks for the super helpful feedback @TomAugspurger and @dcherian! I made a lot of simplifying changes. LMK if anything else needs to be done, otherwise I will merge and move on to the next step of the performance work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants